Method and system for high precision classification of large quantity of information described with mutivariable

ABSTRACT

The present invention provides a method and apparatus for classifying data that can be expressed with multiple variables by similarity using a computer with high accuracy and at high speed, and a method for a computer to execute procedures for classifying data that can be expressed with multiple variables by similarity with high accuracy and high speed, a program for executing the method, and a computer readable recording medium on which is recorded the program. An example of the method comprises the following steps (a) to (f), for classifying input vector data with high accuracy by nonlinear mapping using a computer:  
     (a) inputting input vector data to a computer,  
     (b) setting initial neuron vectors,  
     (c) classifying an input vector into one of the neuron vectors,  
     (d) updating neuron vectors so as to have a similar structure to structures of input vectors classified into the neuron vector and input vectors classified into the neighborhood of the neuron vector,  
     (e) repeating step c and step d until a preset number of learning cycles is reached, and  
     (f) classifying an input vector into one of neuron vectors and outputting.

BACKGROUND OF THE INVENTION

[0001] 1. Technical Field

[0002] The present invention relates to a method of classifying datathat can be expressed with multiple variables by similarity using acomputer with high accuracy and at high speed, an apparatus forclassifying data that can be expressed with multiple variables bysimilarity using a computer with high accuracy and at high speed, and acomputer readable recording medium on which is recorded a program for acomputer to execute procedures for classifying data that can beexpressed with multiple variables by similarity with high accuracy andhigh speed.

[0003] 2. Description of the related art

[0004] In recent years, with the rapid development of informationtechnology, the amount of data available has become enormous, and theimportance of selecting useful information from the data has becomegreater and greater. In particular, developing a technique thatclassifies data that can be expressed with multiple variables bysimilarity using a computer, with high accuracy and at high speed, is animportant subject for research and development for selecting andretrieving useful information for industry.

[0005] Artificial neural network, which is an engineering field of theneurological sciences, originates in a neuron model proposed byMcCulloch and Pitts [Bullet. Math. Biophysics, 5, 115-133 (1943)]. Thecharacteristic of this model is that the output of anexcitatory/inhibitory state is simplified to 1 or 0, and the state isdetermined by the sum of stimuli from other neurons. Hebb published ahypothesis (Hebb rule) whereby in a case where transmitted stimuli causean excitation state in a particular neuron, the connections betweenneurons that have contributed to the occurrence are enhanced, and thestimuli become easier to transmit [The Organization of Behavior, Wiely,62. (1949)]. The idea that changes of connection weight bring plasticityto a neural network, leading to memory and learning, is a basic conceptof artificial neural networks. Rosenblatt's Perceptron [Psycol. Rev.,65, 6, 386-408 (1958)] is used in various fields of classificationproblem, since classification can be performed correctly by increasingor decreasing the connection weights of pattern separators.

[0006] Self-organizing map: hereunder, abbreviated to SOM) developed byKohonen, which uses a competitive neural network, is used forrecognition of images, sound, fingerprints and the like and the controlof production processes of industrial goods [“Application ofSelf-Organizing Maps—two dimensional visualization of multidimensionalinformation” (Authors: Heizo Tokutaka, Satoru Kishida, Kikuo Fujimura;Kaibundo Publishing Company; first published on Jul. 20^(th,) 1999 ISBN4-303-73230-3); “Self-Organizing Maps” (Author: T. Kohonen, translatedby Heizo Tokutaka, Satoru Kishida, Kikuo Fujimura; Springer-Verlag TokyoCo., Ltd. Published on Jun. 15^(th,) 1996 ISBN 4-431-70700-X C3055)]. Inrecent years, as genome information of various organisms has beendecoded, a vast amount of information about life has been accumulated,and it is important to solve the secrets of life from this lifeinformation using computers in fields such as pharmaceuticaldevelopment, and the application of SOMs is booming.

[0007] The conventional Kohonen's self-organization method (hereunderabbreviated to “conventional method”) comprises the following threesteps.

[0008] Step 1: Initialize a vector on each neuron (referred to hereunderas neuron vector) using a random number.

[0009] Step 2: Select the neuron with the closest neuron vector to theinput vector.

[0010] Step 3: Update the selected neuron and the neighboring neuronvectors.

[0011] Step 2 and step 3 are repeated for the number of input vectors.This is defined as one learning cycle, and a specified number oflearning cycles is performed. After learning, the input vectors areclassified as the neurons having the closest neuron vectors. InKohonen's SOM, nonlinear mapping can be performed from input vectors ina higher dimensional space to neurons arranged on a lower dimensionalmap, while maintaining their characteristics.

[0012] In this conventional method, since in step 2 and step 3 theupdating neuron vectors is performed based on the classification ofneuron vectors for each input, a later input vector input later isdiscriminated more accurately. Therefore, there is a problem in thatdifferent self- organizing maps are created depending on the learningorder of input vectors. Furthermore, since random numbers are used inthe initial neuron vector setting in step 1, the structure of the randomnumbers influences the self-organizing map obtained after learning.Therefore, there is a problem that factors other than the input vectorsare reflected in the self-organizing map. Moreover, there are practicalproblems whereby in step 1, since random numbers are used, when theinitial values differ significantly from the structure of the inputvectors, it requires a considerably long learning time, and also insteps 2 and 3, since updating the neuron vectors is performed based onclassifying neuron vectors for every input, the learning time becomeslonger in proportion to the number of input vectors.

[0013] An object of the present invention is to solve theabove-described problems. That is to solve:

[0014] (1) a problem where, since updating neuron vectors is performedbased on classifying neuron vectors for every input in step 2 and step3, later input vectors are discriminated more accurately, and differentself-organizing maps (SOM) are created depending on the learning orderof input vectors, so that the same and reproducible SOMs cannot beobtained,

[0015] (2) a problem where, since in the conventional method, randomnumbers are used in the initial neuron vector setting in step 1, thestructure of the random numbers influences the SOM obtained afterlearning, and thus factors other than the input vectors are reflected inthe SOM, so that the structure of the input vectors cannot be reflectedin the SOM accurately,

[0016] (3) a problem where, in the conventional method, since randomnumbers are used in step 1, when the initial values differ significantlyfrom the structure of the input vectors, a considerably long learningtime is required, and

[0017] (4) a practical problem where, in the conventional method, sinceupdating of neuron vectors is performed based on classifying neuronvectors for every input in steps 2 and 3, the computing time becomeslonger in proportion to the number of input vectors.

SUMMARY OF THE INVENTION

[0018] Regarding the problem described above in (1) of being unable toobtain the same and reproducible SOMs, it has been shown that theproblem can be solved by designing a batch-processing learning algorithmwherein “the individual neuron vectors are updated after all inputvectors are classified into neuron vectors”, and applying it to theprocessing of a sequential processing algorithm in the conventionalmethod, wherein “each input vector is classified into (initial) neuronvectors.”

[0019] Regarding the problem described above in (2) in which thestructure of the input vectors cannot be reflected accurately by a SOM,and the problem in (3) in which a considerably long learning time isrequired, it has been shown that instead of the method of settinginitial neuron vectors using random numbers in the conventional method,the problems can be solved by changing to a method of setting initialneuron vectors by an unsupervised multivariate analysis technique usingthe distribution characteristics of input vectors of multiple dimensionsin multidimensional space, such as principal component analysis,multidimensional scaling or the like.

[0020] Furthermore, regarding the practical problem described above in(4) whereby computing time becomes longer in proportion to the number ofinput vectors, it has been shown that the problem can be solved byapplying a batch-learning algorithm instead of the sequential processingalgorithm performed in the conventional method, and by parallellearning.

[0021] That is, a first embodiment of the present invention is a methodcomprising the following steps (a) to (f), for classifying input vectordata with high accuracy by a nonlinear mapping method using a computer,and the steps are as follows:

[0022] (a) inputting input vector data to a computer,

[0023] (b) setting initial neuron vectors,

[0024] (c) classifying an input vector into one of neuron vectors,

[0025] (d) updating neuron vectors so as to have a similar structure tostructures of input vectors classified into the neuron vector and inputvectors classified into the neighborhood of the neuron vector,

[0026] (e) repeating step c and step d until a preset number of learningcycles is reached, and

[0027] (f) classifying an input vector into one of neuron vectors andoutputting.

[0028] In the above-described method, the input vector data may be dataof K input vectors (K is a positive integer of 3 or above) of Mdimensions (M is a positive integer).

[0029] Furthermore, in the above-described method, initial neuronvectors may be set by reflecting the distribution characteristics ofinput vectors of multiple dimensions in multidimensional space, obtainedby an unsupervised multivariate analysis technique, on the arrangementor elements of initial neuron vectors.

[0030] For an unsupervised multivariate analysis technique, it ispossible to use principal component analysis, multidimensional scalingor the like.

[0031] For a method of classifying an input vector into one of neuronvectors, it is possible to use a classification method or the like basedon similarity scaling, such as scaling, selected from the groupconsisting of distance, inner product, and direction cosine.

[0032] The above distance may be Euclidean distance or the like.

[0033] Furthermore, regarding the classification method in the aboveembodiment, it is also possible to classify input vectors into neuronvectors using a batch-learning algorithm.

[0034] Moreover, using a batch-learning algorithm, it is also possibleto update the neuron vectors to a structure similar to structures of theinput vectors classified into the neuron vector and input vectorsclassified into the neighborhood of the neuron vector.

[0035] The above processing may be performed using parallel computers.

[0036] Another embodiment of the present invention is a methodcomprising the following steps (a) to (f), for classifying input vectordata with high accuracy by a nonlinear mapping method using a computer,and the steps are as follows:

[0037] (a) inputting K input vectors (K is a positive integer of 3 ormore) x_(k) (here, k=1, 2, . . . ,K) of M dimensions (M is a positiveinteger) represented by the following equation (1) to a computer,

x_(k)={x_(k1), x_(k2), . . . , x_(kM)}  (1)

[0038] (b) setting P initial neuron vectors W⁰ ₁ (here, i=1, 2, . . . ,P) arranged in a lattice of D dimensions (D is a positive integer)represented by the following equation (2),

W⁰ ₁=F{x₁, x₂, . . . , x_(K)}  (2)

[0039] (in which, F{x₁, x₂, . . . , x_(k)} represents a conversionfunction for converting from input vectors {x₁, x₂, . . . , x_(K)} toinitial neuron vectors)

[0040] (c) classifying input vectors {x₁, x₂, . . . , x_(K)} after t(here, t is the number of the learning cycle, t=0, 1, 2, . . . T)learning cycles into one of P neuron vectors W^(t) ₁, W^(t) ₂, . . . ,W^(t) _(p), arranged in a lattice of D dimensions, using similarityscaling,

[0041] (d) for each neuron vector W^(t) ₁, updating the neuron vectorW^(t) ₁ so as to have a similar structure to structures of the inputvectors classified into the neuron vector, and input vectors x^(t)₁(S_(t)), x^(t) ₂(S₁), . . . , x^(t) _(N1)(S₁) classified into theneighborhood of the neuron vector, by the following equation (3),

W^(t+1) ₁=G(W^(t) _(1, x) ^(t) ₁(S₁), x^(t) ₂(S₁), . . . , x^(t)_(N1)(S₁) )   (3)

[0042] [in which, x^(t) _(n)(S₁) (n=1, 2, . . . , N₁) represents N₁vectors (N₁ is the number of input vectors classified into neuron i andneighboring neurons) of M dimensions (M is a positive integer), W^(t)_(1′) represents P neuron vectors (t is the number of learning cycles,i=1, 2, . . . , P) arranged in a lattice of D dimensions (D is apositive integer); when a set of input vectors belonging to theneighboring lattice point to a lattice point were a specific neuronvector W^(t) _(1′) is positioned, {x^(t) ₁(S_(1′)), x^(t) _(2′)(S_(1′)),. . . , x^(t) _(N)(S_(1′))} is designated as S_(1′), the above equation(3) is an equation to update the neuron vector W^(t) _(1′) to neuronvector W^(t+1) _(1′)],

[0043] (e) repeating step (c) and step (d) until a preset number oflearning cycles T is reached, and

[0044] (f) classifying the input vectors {x₁, x₂, . . . , x_(K)} intoone of W^(T) ₁, W^(T) ₂, . . . , W^(T) _(p) using similarity scaling,and outputting a result.

[0045] Another embodiment of the present invention is a methodcomprising the following steps (a) to (f) for classifying input vectordata by nonlinear mapping with high accuracy using a computer, and thesteps are as follows:

[0046] (a) inputting K (K is a positive integer of 3 or more) inputvectors x_(k) (here, k=1, 2, . . . , K) of M dimensions (M is a positiveinteger) expressed by the following equation (4) to a computer,

x_(k)={x_(k1), x_(k2), ..., x_(kM)}  (4)

[0047] (b) setting P (P=I×J) initial neuron vectors W⁰ _(y) arranged ina two dimensional (i,j) lattice (i=1, 2, . . . , I, j=1, 2, . . . , J)by the following equation (5), $\begin{matrix}{W_{ij}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (5)\end{matrix}$

[0048] [in which, x_(ave) is the average value of the input vectors, b₁and b₂ are the first principal component vector and the second principalcomponent vector respectively obtained by the principal componentanalysis on the input vectors {x₁, x₂, . . . , x_(K)}, and σ₁ denotesthe standard deviation of the first principal component of the inputvectors.]

[0049] (c) classifying the input vectors {x₁, x₂, x_(K)} after havingbeen through t learning cycles into one of P neuron vectors W^(t) ₁,W^(t) ₂, . . . , W^(t) _(P) arranged in a two-dimensional lattice (t isthe number of learning cycles, t=0, 1, 2, . . . T) using similarityscaling,

[0050] (d) updating each neuron vector W^(t) _(y) to W^(t+1) _(y) by thefollowing equations (6) and (7), $\begin{matrix}{W_{ij}^{t + 1} = {W_{ij}^{t} + {{\alpha (t)}\left( {\frac{\sum\limits_{x_{k} \in S_{ij}}x_{k}}{N_{ij}} - W_{ij}^{t}} \right)}}} & (6) \\{{\alpha (t)} = {\max \left\{ {0.01,{0.6\left( {1 - \frac{t}{T}} \right)}} \right\}}} & (7)\end{matrix}$

[0051] [in which, W^(t) _(y) represents P (P=I×J) neuron vectorsarranged on a two dimensional (i,j) lattice (i=1, 2, . . . , I, j=1, 2,. . . , J) after t learning cycles, and the above equation (6) is anequation to update W^(t) _(y) to W^(t+1) _(y) so as to have asimilarstructure to structures of the input vectors (x_(k)) classified into theneuron vector and N_(y) input vectors x^(t) ₁(S_(y)), x^(t) ₂(S_(y)), .. . , x^(t) _(N)(S_(y)) classified into the neighborhood of the neuronvector; the term α(t) designates a learning coefficient (0<α(t)<1) forepoch t when the number of learning cycles is set to T epochs, and isexpressed using a monotone decreasing function.]

[0052] (e) repeating step (c) and step (d) until a preset number oflearning cycles T is reached, and

[0053] (f) classifying the input vectors {x₁, x₂, . . . , x_(K)} intoone of W^(T) ₁, W^(T) ₂, . . . , W^(T) _(P) using similarity scaling,and outputting a result.

[0054] The other embodiment of the present invention is a computerreadable recording medium on which is recorded a program for performingthe method shown in the above-described embodiment, which updates neuronvectors so as to have a similar structure to structures of input vectorsclassified into the neuron vector and input vectors classified intoneighborhoods of the neuron vector.

[0055] Here, the program recorded on the recording medium may be aprogram using a batch-learning algorithm.

[0056] Furthermore, the program recorded on the recording medium may bea program for performing the processing of the following equation (8).

W^(t+1) ₁=G(W^(t) ₁, x^(t) ₁(S₁), x^(t) ₂(S₁), . . . , x^(t) _(N1)(S¹) )  (8)

[0057] [in which, x^(t) _(k)(k=1, 2, . . . , N) represents K inputvectors (K is a positive integer of 3 or more) of M dimensions (M is apositive integer), W^(t) ₁ represents P neuron vectors (t is the numberof learning cycles, i=1, 2, . . . , P) arranged in a lattice of Ddimensions (D is a positive integer); when a set of input vectors {x^(t)₁(S₁), x^(t) ₂(S₁), . . . , x^(t) _(N)(S₁)} belonging to the neighboringlattice point of a lattice point where a specific neuron vector W^(t) ₁is positioned equals S,, the above equation (8) is an equation to updatethe neuron vector W^(t) _(1′) to neuron vector W^(t+1) ₁.]

[0058] Furthermore, the program recorded on the recording medium may bea program for performing the processing of the following equations (9)and (10). $\begin{matrix}{W_{ij}^{t + 1} = {W_{ij}^{t} + {{\alpha (t)}\left( {\frac{\sum\limits_{x_{k} \in S_{ij}}x_{k}}{N_{ij}} - W_{ij}^{t}} \right)}}} & (9) \\{{\alpha (t)} = {\max \left\{ {0.01,{0.6\left( {1 - \frac{t}{T}} \right)}} \right\}}} & (10)\end{matrix}$

[0059] [in which, W^(t) _(y) represents P (P=I×J) neuron vectorsarranged in a two dimensional (i,j) lattice (i=1, 2, . . . , I, j=1, 2,. . . , J) after t learning cycles, and the above equation (9) is anequation to update W^(t) _(y) to W^(t+1) _(y) so as to have a similarstructure to structures of input vectors (x_(k)) classified into theneuron vector and N_(y) input vectors x^(t) ₁(S_(y)), x^(t) ₂(S_(y)), .. . , x ^(t) _(N)(S_(y)) classified into the neighborhood of the neuronvectors. The term α(t) designates a learning coefficient (0<α(t)<1) forthe t-th epoch when the number of learning cycles is set to T epochs,and expressed using a monotone decreasing function.]

[0060] Furthermore, the abovementioned recording medium may be acomputer readable recording medium on which is recorded a program forsetting the initial neuron vectors in order to perform theabovementioned method.

[0061] Moreover, the recording medium is characterized in that therecorded program is a program for performing the processing of thefollowing equation (11).

W⁰ ₁=F{x₁, x₂, . . . , x_(K)}  (11)

[0062] [in which, W⁰ ₁ represents P initial neuron vectors arranged in alattice of D dimensions (D is a positive integer), i is one of 1, 2, . .. , P, and F{x₁, x₂, . . . , x_(k)} is a function for converting inputvectors {x₁, x₂, . . . , x_(K)} to K initial neuron vectors.

[0063] Furthermore, the recording medium is characterized in that therecorded program is a program for performing the processing of thefollowing equation (12). $\begin{matrix}{W_{ij}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (12)\end{matrix}$

[0064] [in which, W⁰ _(y) represents P (P=I×J) initial neuron vectorsarranged in a two dimensional (i,j) lattice (i=1, 2, . . . , I, j=1, 2,. . . , J), x_(ave) is the average value of K (K is a positive integerof 3 or above) input vectors {x₁, x₂, . . . , x_(K)} of M dimensions (Mis a positive integer), b₁ and b₂ are a first principal component vectorand a second principal component vector, respectively obtained byprincipal component analysis on the input vectors {x₁, x₂, . . . ,x_(K)}, and σ₁ is the standard deviation of the first principalcomponent of the input vectors.]

[0065] Furthermore, this may also be a computer readable recordingmedium characterized in that the recorded program has a program forsetting initial neuron vectors for performing the above-describedmethod, and a program for updating neuron vectors to a similar structureto structure of the input vectors classified into the neuron vector andthe input vectors classified into the neighborhood of the neuron vector.

[0066] Moreover, it may also include a recording medium on which arerecorded a program for performing the processing of the followingequation (13) and a program for performing the processing of thefollowing equation (14).

W⁰ ₁=F{x₁, x₂, . . . , x_(K)}  (13)

[0067] (in which, W⁰ ₁ represents P initial neuron vectors arranged in alattice of D dimensions (D is a positive integer), i is one of 1, 2, . .. , P, and F{x₁, x₂, . . . , x_(k)} is a function for converting from K(K is a positive integer of 3 or above) input vectors of dimension M (Mis a positive integer) {x₁, x₂, . . . , x_(K)} to initial neuronvectors)

W^(t+1) ₁=G(W^(t) ₁, x^(t) ₁(S_(1′)), x^(t) ₂(S_(1′)), . . . , x^(t)_(N)(S_(1′)))   (14)

[0068] [in which, x^(t) _(n)(s₁) (n=1, 2, . . . , Ni) represents Ni (Niis the number of input vectors classified into neuron i and theneighboring neurons) input vectors of M dimensions (M is a positiveinteger), W^(t) ₁, represents P neuron vectors (t is the number of thelearning cycle, i=1, 2, . . . , P) arranged in a lattice of D dimensions(D is a positive integer), and the above equation (14) is an equation toupdate W^(t) ₁ to W^(t+1) ₁ such that each neuron vector has a similarstructure to structures of the Ni input vectors x^(t) _(n)(S₁)classified into the neuron vector].

[0069] Furthermore, this is a recording medium on which is recorded aprogram for performing the processing of the following equations (15),(16) and (17). $\begin{matrix}{W_{ij}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (15)\end{matrix}$

[0070] [in which, W⁰ _(y) represents P (P=I×J) initial neuron vectorsarranged in a two-dimensional (i, j) lattice (i=1, 2, . . . , I,j=1, 2,. . . , J), x_(ave) is the average value of K (K is a positive integerof 3 or above) input vectors {x₁, x₂, . . . , x_(K)} of M dimensions (Mis a positive integer), b₁ and b₂ are the first principal componentvector and the second principal component vector respectively obtainedby performing principal component analysis on the input vectors {x₁, x₂,. . . , x_(K)}, and σ₁ is the standard deviation of the first principalcomponent of the input vectors.] $\begin{matrix}{W_{ij}^{t + 1} = {W_{ij}^{t} + {{\alpha (t)}\left( {\frac{\sum\limits_{x_{k} \in S_{ij}}x_{k}}{N_{ij}} - W_{ij}^{t}} \right)}}} & (16) \\{{\alpha (t)} = {\max \left\{ {0.01,{0.6\left( {1 - \frac{t}{T}} \right)}} \right\}}} & (17)\end{matrix}$

[0071] [Here, W^(t) _(y) represents P (P=I×J) initial neuron vectors (tis the number of learning cycles, t=1, 2, . . . ,T) arranged in a twodimensional (i, j) lattice (i=1, 2, . . . , I, j=1, 2, . . . , J), andthe above equation (16) is an equation to update W^(t) _(y) to W^(t+1)_(y) such that each neuron vector has a similar structure to structuresof input vectors classified into the neuron vector and N_(y) inputvectors x^(t) _(n)(S_(y)) classified into the neighborhood of the neuronvector. The term a(t) denotes a learning coefficient (0<α(t)<1) forepoch t when the number of learning cycles is set to T epochs, andexpressed using a monotone decreasing function.]

[0072] The recording medium on which the abovementioned program isrecorded is a recording medium selected from floppy disk, hard disk,magnetic tape, CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM and DVD-RW.

[0073] Furthermore, the present embodiment is a computer based system,using the abovementioned computer readable recording medium.

BRIEF DESCRIPTION OF THE DRAWINGS

[0074]FIG. 1 is a diagram showing a flow chart of an algorithm ofself-organization method of the present invention.

[0075]FIG. 2 is a drawing showing the results of creating a SOM in whicheach gene of sixteen kinds of microorganism is classified by performingprincipal component analysis using input vectors based on the codonusage frequencies of 29596 genes of sixteen kinds of microorganism tocreate initial neuron vectors, and by updating the neuron vectors usinga method of the present invention. Class numbers of organisms shown in aTable 1 are described for neurons wherein genes of only one species areclassified.

[0076]FIGS. 3A and 3B are drawings showing the result of creating a SOM,wherein initial neuron vectors in which random numbers are used fortheir initial values are created using input vectors based on the codonusage frequencies of 29596 genes of sixteen kinds of microorganism, andgenes for each of the sixteen kinds of microorganism are classified. Theresults of two independent analyses of creation are shown in FIGS. 3Aand 3B.

[0077]FIG. 4 shows the relationship between number of learning cyclesand learning evaluation value when creating a SOM.

[0078] Numeral (1) shows the relationship between the number of learningcycles and the learning evaluation value when creating a SOM in whicheach gene of sixteen kinds of microorganism is classified by performingprincipal component analysis using input vectors based on the codonusage frequencies of 29596 genes of sixteen kinds of microorganism tocreate the initial neuron vectors, and by updating the neuron vectors bya method of the present invention.

[0079] Numeral (2) shows the relationship between the number of learningcycles and the learning evaluation value when random numbers are usedfor the initial values instead of performing the principal componentanalysis in (1).

[0080]FIGS. 5A and 5B show results of creating a SOM, wherein each geneof sixteen kinds of microorganism is classified by performing principalcomponent analysis using input vectors based on the codon usagefrequencies of 29596 genes of sixteen kinds of microorganism, to createthe initial neuron vectors, and by updating the neuron vectors by theconventional method.

[0081]FIG. 6 is a drawing of a SOM created by the method of the presentinvention using expression level data of 5544 kinds of genes in 60cancer cell lines. The numbers in the figure denote the numbers of theclassified genes.

[0082]FIGS. 7A, 7B, and 7C are drawings showing vector values of neuronvectors of each strain of cancer cell line in a SOM created by themethod of the present invention using expression level data of 5544kinds of genes in 60 cancer cell lines. FIG. 7A represents the vectorvalue of a neuron vector at the position [16, 29] in the SOM, and FIGS.7B and 7C represent the vector values of the neuron vectors of the genesclassified at the position [16, 29].

DETAILED DESCRIPTION OF THE INVENTION

[0083] As follows is a detailed description of the present invention.

[0084] The present invention provides a high accuracy classificationmethod and system using a computer by a nonlinear mapping method havingsix steps:

[0085] (Step 1) inputting input vector data to a computer,

[0086] (Step 2) setting initial neuron vectors by a computer,

[0087] (Step 3) classifying input an vector into one of neuron vectorsby a computer,

[0088] (Step 4) updating neuron vectors so as to have a similarstructure to structures of input vectors classified into the neuronvector and input vectors classified into the neighborhood of the neuronvectors,

[0089] (Step 5) repeating step 3 and step 4 until a preset number oflearning cycles is reached, and

[0090] (Step 6) classifying input vector into one of neuron vectors andoutput by a computer.

[0091] The above steps are shown as a flow chart in FIG. 1.

[0092] As follows is a detailed description of each step.

[0093] (Step 1)

[0094] Input vector data are input to a computer.

[0095] For input vector data, any input vector data that are based ondata to be analyzed can be used.

[0096] Any data that is useful to industry may be used as data to beanalyzed.

[0097] To be specific, biological data such as nucleotide sequences,amino acid sequences, results of DNA chip analyses and the like, datasuch as image data, audio data and the like obtained by variousmeasuring instruments, and data such as diagnostic results,questionnaire results and the like can be included.

[0098] There are normally K (K is a positive integer of 3 or above)input vectors {x₁, x₂, . . . , x_(K)} of M dimensions (M is a positiveinteger), and each input vector x_(k) can be represented by thefollowing equation (18).

x_(k)={x_(k1), x_(k2), . . . , x_(kM)}  (18)

[0099] For k in the equation (1), k=1, 2, . . . , K.

[0100] The input vectors are set based on data to be analyzed. Normally,the input vectors are set according to the usual method described in“Application of Self-Organizing Maps —two dimensional visualization ofmultidimensional information” (Authors: Heizo Tokutaka, Satoru Kishida,Kikuo Fujimura; Kaibundo Publishing Company; first published on Jul.20^(th), 1999 ISBN 4-303-73230-3); “Self-Organizing Maps” (Author: T.Kohonen, translated by Heizo Tokutaka, Satoru Kishida, Kikuo Fujimura;Springer-Verlag Tokyo Co., Ltd. Published on Jun. 15^(th), 1996 ISBN4-431-70700-X C3055), and the like.

[0101] An example of this setting follows.

[0102] 1) Classification of microorganism genes

[0103] In a case where K kinds of gene originating from a plurality ofmicroorganisms are classified, the nucleotide sequence information ofthese genes is converted such that it can be expressed numerically in Mdimensions (64 dimensions in a case where codon usage frequency is used)based on codon usage frequency. Data of M dimensions convertednumerically in this manner are used as input vectors.

[0104] 2) Classification of human genes by expression characteristics

[0105] In a case where K kinds of genes originating from human areclassified by expression patterns in M kinds of cell lines withdifferent characteristics, the expression levels of these genes in Mkinds of cell lines are used as numerical values, and data of Mdimensions consisting of the numerical values are set as input vectors.

[0106] Step 1 is a step in which input vector data based on informationdata to be analyzed are input to a computer, and this input can beperformed by normal methods, such as manual input, voice input, paperinput and the like.

[0107] (Step 2)

[0108] Initial neuron vectors are set using a computer.

[0109] The initial neuron vectors can be set based on random numberssimilarly to the conventional method. For the random numbers, randomnumbers and the like generated on the computer using the C languagestandard function rand ( ) can be used.

[0110] In the case where it is desired that the structure of inputvectors is reflected to a SOM accurately, or learning time is shortened,it is preferable to set the initial neuron vectors based on the data ofK input vectors {x₁, x₂, . . . , x_(K)} of M dimensions set in the abovestep 1 using a multivariate analysis technique such as principalcomponent analysis, multidimensional scaling and the like, rather thansetting the initial neuron vectors based on random numbers.

[0111] In the case where the initial neuron vectors set in this mannerconsists of a set of P neuron vectors {W⁰ ₁, W⁰ ₂, . . . , W⁰ _(P)}arranged in a lattice of D dimensions (D is a positive integer), eachneuron vector can be represented by the following equation (19).

W⁰ ₁=F{x₁, x₂, . . . , x_(K)}  (19)

[0112] In the equation (2), i=1, 2, . . . , P. Furthermore, F{x₁, x₂, .. . , x_(k)} in the equation (2) represents a function for convertingfrom input vectors {x₁, x₂, . . . , x_(K)} to initial neuron vectors.

[0113] For a specific example, a method of setting the initial neuronvectors in a two- dimensional (D=2) or three-dimensional (D=3) latticewill be described. In accordance with this method, it is possible to setthe initial neuron vectors in a lattice of D dimensions.

[0114] (1) Method of setting initial neuron vectors in a two dimensionallattice (D=2)

[0115] Principal component analysis is performed for K input vectors{x₁, x₂, . . . , x_(K)} of M dimensions to obtain a first principalcomponent vector and a second principal component vector, and theobtained principal component vectors are designated b₁ and b₂,respectively.

[0116] Based on these two principal component vectors, principalcomponents Z_(1k)=b₁x_(k) and Z_(2k)=b₂x_(k) for K input vectors areobtained (k=1, 2, . . . , K). The standard deviations of {Z₁₁, Z₁₂, . .. , Z_(1k), . . . , Z_(1K)} and {Z₂₁, Z₂₂, . . . , Z_(2k), . . . ,Z_(2K)} are designated σ₁ and σ₁ respectively.

[0117] The average value of the input vectors is obtained, and theaverage value obtained is designated x_(ave).

[0118] Two-dimensional lattice points with two dimensions arerepresented by ij(i=1, 2, . . . , I, j=1, 2, . . . , J), and neuronvectors W⁰ _(y) are placed at the two-dimensional lattice points (ij).The values of I and J may be integers of 3 or above. Preferably, J isthe largest integer less than I×σ₂/σ₁. The value of I may be setappropriately depending on the number of input vector data. In general avalue of 50 to 1000 is used, and typically a value of 100 is used.

[0119] W⁰ _(y) can be defined by an equation (20). $\begin{matrix}{W_{ij}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (20)\end{matrix}$

[0120] (2) Method of setting initial neuron vectors in athree-dimensional lattice (D=3)

[0121] In the principal component analysis in (I) described above, athird principal component vector is obtained in addition to the firstprincipal component vector and the second principal component vector,and the obtained first, second and third principal component vectors aredesignated b₁, b₂ and b₃ respectively.

[0122] Based on these three principal components, principal componentsZ_(1k)=b₁x_(k), Z_(2k)=b₂x_(k), and Z_(3k)=b₃x_(k) are obtained. Thestandard deviations of {Z₁₁, Z₁₂, . . . , Z_(1k), . . . , Z_(1K)}, {Z₂₁,Z₂₂, . . . , Z_(2k), . . . , Z_(2K)} and {Z₃₁, Z₃₂, . . . , Z_(3k), . .. , Z_(3K)} are designated σ₁, σ₂, and σ₃ respectively.Three-dimensional lattice points are represented by ijl(i=1, 2, . . . ,I, j=1, 2, . . . , J, l=1, 2, . . . , L), and neuron vectors W⁰ _(y1)are placed at the three-dimensional lattice points (ijl). The values ofI, J and L may be integers of 3 or above. Preferably, J and L are thelargest integers less than I×σ₂/σ₁ and I×σ₃/σ₁ respectively. The valueof I may be set appropriately depending on the number of input vectordata. In general a value of 50 to 1000 is used, and typically a value of100 is used.

[0123] W⁰ _(y1) can be defined by an equation (21). $\begin{matrix}{W_{ijl}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (21)\end{matrix}$

[0124] (Step 3)

[0125] All of the input vectors {x₁, x₂, . . . , x_(K)} are classifiedinto neuron vectors.

[0126] To be specific, after all of the input vectors {x₁, x₂, . . . ,x_(K)} t learning cycles using similarity scaling (distance, innerproduct, direction cosine or the like), they are each classified as oneof P neuron vectors W^(t) ₁, W^(t) ₂, . . . , W^(t) _(P) using acomputer.

[0127] Here, t is the number of the learning cycle (epoch). In the caseof T learning cycles, t=0, 1, 2, . . . T. The I-th neuron vector at thet-th epoch t can be represented by W ^(t) _(i). Here, i=1, 2, . . . , P.

[0128] The neuron vectors of the value of t=0 correspond to the initialneuron vectors set in step 2.

[0129] Classification of each input vector x_(k) can be performed bycalculating the Euclidean distance to each neuron vector W^(t) ₁, andclassifying the input vector into the neuron vector having the smallestEuclidean distance. Here, in the case of a neuron vector located at atwo-dimensional lattice point (ij), W^(t) ₁ can be represented by W^(t)_(y).

[0130] The input vectors {x₁, x₂, . . . , x_(K)} may be classified intoW^(t) ₁ by parallel processing for each input vector x_(k).

[0131] (Step 4)

[0132] For each neuron vector W^(t) ₁, the neuron vector W^(t) ₁ isupdated so as to have a similar structure to structures of input vectors(x_(k)) classified into the neuron vector and input vectors classifiedinto the neighborhood of the neuron vector.

[0133] That is to say, a set of input vectors belonging to a latticepoint at which a specific neuron vector W^(t) _(1′) is positioned isdesignated S_(1′). The neuron vectors W^(t) _(1′) (i′=1, 2, . . . , P)are updated by obtaining new neuron vectors (W^(t+1) _(1′)) that reflectthe structure of input vectors belonging to S_(1′) from N vectors x^(t)₁(S_(1′)), x^(t) ₂(S_(1′)), . . . , x^(t) _(N)(S_(i′)) belonging toS_(1 ′) and W^(t) _(1′), using function G in the following equation(22).

W^(t+1) _(1′)=G(W^(t) _(1′), x^(t) ₁(S_(1′)), x^(t) ₂(S_(1′)), . . . ,x^(t) _(n)(S_(1′)))   (22)

[0134] For a specific example, updating of a neuron vector W^(t) _(y)set on a two-dimensional lattice will be described. A neuron vector seton a lattice of D dimensions may be updated in the same manner.

[0135] When an input vector x_(k) is belonging to a neuron vector W^(t)_(y) arranged in a two-dimensional lattice, and a set of input vectorsbelonging to neighboring lattice points of the lattice point at whichW^(t) _(y) is positioned is designated S_(y) it is possible to updatethe neuron vector W^(t) _(y) by obtaining a new neuron vector W^(t+1)_(y) that reflects the structure of input vectors belonging to S_(y)from N_(y) input vectors x^(t) ₁(S_(y)), x^(t) ₂(S_(y)), . . . , x^(t)_(Ny)(S_(y)) belonging to S_(y) and W^(t) _(y), by the followingequation (23). $\begin{matrix}{W_{ij}^{t + 1} = {W_{ij}^{t} + {{\alpha (t)}\left( {\frac{\sum\limits_{x_{k} \in S_{ij}}x_{k}}{N_{ij}} - W_{ij}^{t}} \right)}}} & (23)\end{matrix}$

[0136] Here, N_(y) is the total number of input vectors classified intoS_(y).

[0137] The term α(t) designates a learning coefficient (0<α(t)<1) forepoch t when the number of learning cycles is set to T epochs, and usesa monotone decreasing function. Preferably, it can be obtained by thefollowing equation (24).

[0138] The number of learning cycles T may be set appropriatelydepending on the number of input vector data. In general it is setbetween 10 epochs and 1000 epochs, and typically 100 epochs.$\begin{matrix}{{\alpha (t)} = {\max \left\{ {0.01,{0.6\left( {1 - \frac{t}{T}} \right)}} \right\}}} & (24)\end{matrix}$

[0139] The neighboring set S_(y) is a set of input vectors x_(y)classified as lattice points i′j′ which satisfy the conditions ofi−β(t)≦i′≦i+α(t) and j−β(t)<j′≦j+β(t). The symbol β(t) represents thenumber that determines the neighborhood, and is obtained by an equation(25).

β(t)=max{0, 25−t}  (25)

[0140] It is possible to update the neuron vector {^(t) ₁, W^(t) ₂, . .. , W^(t) _(P)} by parallel processing for each neuron vector W^(t) ₁.

[0141] (Step 5)

[0142] Learning is performed by repeating step 3 and step 4 until thepreset number of epochs T is reached.

[0143] (Step 6)

[0144] After learning is completed, corresponding to the method in step3, input vectors x_(k) are classified into neuron vectors W^(T) ₁ by acomputer, and the results are output.

[0145] Based on the classification reference represented by W^(T) ₁, inwhich the structure of the input vectors is reflected, the input vectorsx_(k) are classified. That is, in the case where a plurality of inputvectors are classified as the same neuron vector, it is clear that thevector structures of these input vectors are very similar.

[0146] It is possible to classify the input vectors {x₁, x₂, . . . ,x_(K)} by parallel processing for each input vector x_(k).

[0147] The output of the classification result by the above-describedsteps may be visualized by displaying it as a SOM.

[0148] Creation and display of a SOM may be performed according to themethod described in “Application of Self-Organizing Maps—two dimensionalvisualization of multidimensional information” (Authors: Heizo Tokutaka,Satoru Kishida, Kikuo Fujimura; Kaibundo Publishing Company; firstpublished on Jul. 20^(th), 1999 ISBN 4-303-73230-3); “Self-OrganizingMaps” (Author: T. Kohonen, translated by Heizo Tokutaka, Satoru Kishida,Kikuo Fujimura; Springer-Verlag Tokyo Co., Ltd. Published on Jun.15^(th), 1996 ISBN 4-431-70700-X C3055), and the like.

[0149] For example, the classification results of input vectors obtainedby placing neuron vectors at two-dimensional lattice points can bedisplayed as a two-dimensional SOM using “Excel” spreadsheet softwarefrom Microsoft Corporation or the like. To be specific, after applying asuitable label to each lattice point based on the characteristics of theinput vectors belonging to each lattice point of neuron vectors havingtwo-dimensional lattice points, these label values are exported toExcel, and using the functions of Excel, these labels can be displayedas a SOM on a monitor, printed or the like in a two-dimensional lattice.It is also possible that the values of the total number of input vectorsbelonging to each lattice point are exported to Excel, and these valuesof the total number are displayed as a SOM on a monitor, printed or thelike in a two-dimensional lattice using the functions of Excel.

[0150] For a computer to use in the above-described steps, anything canbe used as long as it has the functions of a computer. However, it ispreferable to use one with a fast calculation speed. A specific exampleis a SUN Ultra 60 workstation manufactured by Sun Microsystems Inc. andthe like. The above steps 1 to 6 do not need to be performed using thesame computer. That is, it is possible to output a result obtained inone of the above steps to another computer, and process the succeedingstep in the other computer.

[0151] Furthermore, it is also possible to perform the computationalprocessing of the steps (steps 3, 4, 5 and 6), for which parallelprocessing is possible, in parallel using a computer with multiple CPUsor a plurality of computers. In the conventional method, a sequentiallearning algorithm is used, so parallel processing is not possible.However, in the present invention, a batch-learning algorithm is used,so parallel processing is possible.

[0152] Since parallel processing is possible, the computing timerequired to classify input vectors can be shortened considerably.

[0153] That is, if the times to process the above 6 steps using oneprocessor are T1, T2, T3/C, T4/C, T5/C and T6 respectively, and parallelprocessing is performed by C processors, ideally the times required byeach step become T1, T2, T3, T4, T5 and T6 respectively, and the totalcomputing time can be shortened byT1 + T2 + T3 + T4 + T5 + T6 − {T1 + T2 + (T3 + T4 + T5)/C + T6} = (1 − 1/C)(T3 + T4 + T5)

[0154] The steps 2 to 6 can be automated by using a computer readablerecording medium, on which a program for performing the procedure fromsteps 2 to 6 is recorded. The recording medium is a recording medium ofthe present invention.

[0155] “Computer readable recording medium” means any recording mediumthat a computer can read and access directly. Such a recording mediummay be a magnetic storage medium such as floppy disk, hard disk,magnetic tape and the like, an optical storage medium such as CD-ROM,CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW and the like, an electric storagemedium such as RAM, ROM and the like, or a hybrid (for example, amagnetic/optical storage medium such as MO) of these categories.However, it is not limited to these.

[0156] The computer based system, wherein the above-described computerreadable recording medium of the present invention is used, is a systemof the present invention.

[0157] “Computer based system” means one comprising a hardware device, asoftware device and a data storage device, which are used for analyzinginformation stored on a computer readable recording medium of thepresent invention.

[0158] The hardware device basically comprises an input devices, a datastorage device, a central processing unit and an output device.

[0159] Software device means a device which uses a program for acomputer to perform the procedures from steps 2 to 6 stored on therecording medium of the present invention.

[0160] Data storage device means memory to store input information andcalculation results, and a memory access device that can access it.

[0161] That is to say, a computer based system of the present inventionis characterized in that there is provided:

[0162] (i) an input device for inputting input vector data;

[0163] (ii) a software device for processing the input data using aprogram for the computer to perform steps 2 to 6; and

[0164] (iv) an output device for outputting classification resultsobtained by the software device in (iii).

[0165] As follows are examples of the present invention.

EXAMPLES Example 1

[0166] For each of 29596 genes of 16 kinds of microorganism described inthe Table 1, principal component analysis based on input vectors basedon the codon usage frequency of the gene is performed, using a SUN Ultra60 workstation manufactured by Sun Microsystems Inc. to create initialneuron vectors, and a SOM is created. A DNA sequence data array for eachgene was obtained from ftp://ncbi.nlm.nih.gov/genbank/genomes/bacteria/.

[0167] As follows is a detailed description. TABLE 1 Training set usedfor development of neuron vectors Number of ID Name of OrganismAbbreviation Genes Class number Archaeoglobus fulgidus AFU 2088 1 1-2088Aquifex aeolicus AAE 1489 2  2089-  3577 Borrelia burgdorferi BBU 772 3 3578-  4349 Bacillus subtilis BSU 3788 4  4350-  8137 Chlamydiatrachomatis CTR 833 5  8138-  8970 Escherichia coli ECO 3913 6  8971-12883 Helicobacter pylori HPY 1392 7 12884- 14275 Haemophilus influenzaeHIN 1572 8 14276- 15847 Methanococcus jannashii MJA 1522 9 15848- 17369Methanobacterium MTH 1646 10 17370- thermoautotrophicum 19015Mycobacterium tuberculosis MTU 3675 11 19016- 22690 Mycoplasmagenitalium MGE 450 12 22691- 23140 Mycoplasma pneumoniae MPN 657 1323141- 23797 Pyrococcus horikoshii PHO 1973 14 23798- 25770Synechocystis sp. SYN 2909 15 25771- 28679 Treponema pallidum TPA 917 1628680- 29596

[0168] (a) Calculation of Input Vectors and Setting of Initial NeuronVectors

[0169] For genes of each microorganism in Table 1, each gene is given anID number as shown in Table 1 such that Archcaeoglobus fulgidus, AF ODU1gene is the first gene, and Treponema pallidum, TP1041 gene is the29596^(th) gene.

[0170] For all of the genes, the frequency of each of 64 kinds of codonsin a translated region from translation start codon to termination codonis obtained according to the codon number table in Table 2. A vectorcomprising the codon frequency C_(km) (m=1, 2, . . . , 64) of gene k isdesignated C_(k)=(C_(k1),C_(k2), . . . , C_(k64),).

[0171] To be specific, for Escherichia coli thr A gene (8971^(st) gene)in Table 1, for example, the number of the 1^(st) codon (Phe) is 11, the2^(nd) codon (Phe) is 19, the 3^(rd) codon (Leu) is 10, the 4^(th) codon(Leu) is 13, the 5^(th) codon (Ser) is 11, the 6^(th) codon (Ser) is 10,the 7^(th) codon (Ser) is 6, the 8^(th) codon (Ser) is 9, the 9^(th)codon (Tyr) is 12, the 10^(th) codon (Tyr) is 8, the 11^(th) codon (Ter)is 0, the 12^(th) codon (Ter) is 0, the 13^(th) codon (Cys) is 3, the14^(th) codon (Cys) is 9, the 15^(th) codon (Ter) is 1, the 16^(th)codon (Trp) is 4, the 17^(th) codon (Leu) is 8, the 18^(th) codon (Leu)is 13, the 19^(th) codon (Leu) is 2, the 20^(th) codon (Leu) is 43, the21^(th) codon (Pro) is 3, the 22^(nd) codon (Pro) is 6, the 23^(rd)codon (Pro) is 2, the 24^(th) codon (Pro) is 18, the 25^(th) codon (His)is 8, the 26^(th) codon (His) is 6, the 27^(th) codon (Gln) is 11, the28^(th) codon (Gln) is 19, the 29^(th) codon (Arg) is 18, the 30^(th)codon (Arg) is 19, the 31^(st) codon (Arg) is 3, the 32^(nd) codon (Arg)is 4, the 33^(rd) codon (Ile) is 30, the 34^(th) codon (Ile) is 15, the35^(th) codon (Ile) is 1, the 36^(th) codon (Met) is 23, the 37^(th)codon (Thr) is 5, the 38^(th) codon (Thr) is 19, the 39^(th) codon (Thr)is 2, the 40^(th) codon (Thr) is 8, the 41^(st) codon (Asn) is 22, the42^(nd) codon (Asn) is 16, the 43^(rd) codon (Lys) is 22, the 44^(th)codon (Lys) is 12, the 45^(th) codon (Ser) is 3, the 46^(th) codon (Ser)is 12, the 47^(th) codon (Arg) is 0, the 48^(th) codon (Arg) is 2, the49^(th) codon (Val) is 19, the 50^(th) codon (Val) is 18, the 51^(st)codon (Val) is 5, the 52^(nd) codon (Val) is 27, the 53^(rd) codon (Ala)is 15, the 54^(th) codon (Ala) is 36, the 55^(th) codon (Ala) is 14, the56^(th) codon (Ala) is 26, the 57^(th) codon (Asp) is 30, the 58^(th)codon (Asp) is 14, the 59^(th) codon (Glu) is 40, the 60^(th) codon(Glu) is 13, the 61^(st) codon (Gly) is 22, the 62^(nd) codon (Gly) is22, the 63^(rd) codon (Gly) is 9, the 64^(th) codon (Gly) is 1, so thecodon usage frequency vector of the gene can be expressed as C₈₉₇₁=(11,19, 10, 13, 11, 10, 6, 9, 12, 8, 0, 0, 3, 9, 1, 4, 8, 13, 2, 43, 3, 6,2, 18, 8, 6, 11, 19, 18, 19, 3, 4, 30, 15, 1, 23, 5, 19, 2, 8, 22, 16,22, 12, 3, 12, 0, 2, 19, 18, 5, 27, 15, 36, 14, 26, 30, 14, 40, 13, 22,22, 9, 10). TABLE 2 Codon number table First Second Letter Third LetterT C A G Letter T  1(Phe)  5(Ser)  9(Tyr) 13(Cys) T  2(Phe)  6(Ser)10(Tyr) 14(Cys) C  3(Leu)  7(Ser) 11(Ter*) 15(Ter*) A  4(Leu)  8(Ser)12(Ter*) 16(Trp) G C 17(Leu) 21(Pro) 25(His) 29(ArG) T 18(Leu) 22(Pro)26(His) 30(ArG) C 19(Leu) 23(Pro) 27(Gln) 31(ArG) A 20(Leu) 24(Pro)28(Gln) 32(ArG) G A 33(Ile) 37(Thr) 41(Asn) 45(Ser) T 34(Ile) 38(Thr)42(Asn) 46(Ser) C 35(Ile) 39(Thr) 43(Lys) 47(ArG) A 36(MeT) 40(Thr)44(Lys) 48(ArG) G G 49(Val) 53(AlA) 57(Asp) 61(Gly) T 50(Val) 54(AlA)58(Asp) 62(Gly) C 51(Val) 55(AlA) 59(Glu) 63(Gly) A 52(Val) 56(AlA)60(Glu) 64(Gly) G

[0172] When the codon usage frequency vector of gene ID k determined bythe above method is C_(k), input vector x_(k)={x_(k1), x_(k2), . . . ,x_(kM)} (M=64) of gene ID k can be calculated by the following equation(8). For m here, m=1, 2, . . . , 64. $\begin{matrix}{x_{km} = \frac{C_{km}}{\sum\limits_{n = 1}^{M}C_{kn}}} & (9)\end{matrix}$

[0173] To be specific, in gene thrA of Escherichia coli thrA gene ofcodon usage frequency vector C₈₉₇₁=(11, 19, 10, 13, 11, 10, 6, 9, 12, 8,0, 0, 3, 9, 1, 4, 8, 13, 2, 43, 3, 6, 2, 18, 8, 6, 11, 19, 18, 19, 3, 4,30, 15, 1, 23, 5, 19, 2, 8, 22, 16, 22, 12, 3, 12, 0, 2, 19, 18, 5, 27,15, 36, 14, 26, 30, 14, 40, 13, 22, 22, 9, 10), the input vectors becomex₈₉₇₁ =(0.0134, 0.0231, 0.0122, 0.0158, 0.0134, 0.0122, 0.0073, 0.0110,0.0146, 0.0097, 0.0000, 0.0000, 0.0037, 0.0110, 0.0012, 0.0049, 0.0097,0.0158, 0.0024, 0.0524, 0.0037, 0.0073, 0.0024, 0.0219, 0.0097, 0.0073,0.0134, 0.0231, 0.0219, 0.0231, 0.0037, 0.0049, 0.0365, 0.0183, 0.0012,0.0280, 0.0061, 0.0231, 0.0024, 0.0097, 0.0268, 0.0195, 0.0268, 0.0146,0.0037, 0.0146, 0.0000, 0.0024, 0.0231, 0.0219, 0.0061, 0.0329, 0.0183,0.0438, 0.0171, 0.0317, 0.0365, 0.0171, 0.0487, 0.0158, 0.0268, 0.0268,0.0110, 0.0122).

[0174] A first principle component vector b₁ and a second principalcomponent b₂ are obtained by performing principal component analysis forinput vectors based on all of the 29596 kinds of genes, which arecreated by the above method. The results are shown as follows.

[0175] b₁=(−0.1876, 0.0710, −0.2563, −0.0100, −0.0778, 0.0400, −0.0523,0.0797, −0.1245, 0.0254, −0.0121, 0.0013, −0.0203, 0.0228, 0.0025,0.0460, −0.0727, 0.0740, −0.0353, 0.2936, −0.0470, 0.0686, −0.0418,0.1570, −0.0229, 0.0582, −0.0863, 0.1070, 0.0320, 0.1442, 0.0208,0.1180, −0.2116, 0.1070, −0.1550, 0.0048, −-0.0676, 0.1536, −0.0664,0.0607, −0.1962, 0.0128, −0.3789, −0.0720, −0.0492, 0.0331, −0.0975,−0.0176, −0.1183, 0.1423, −0.0577, 0.1757, −0.0690, 0.2778, −0.0260,0.2361, −0.1318, 0.1606, −0.2148, 0.0412, 0.0260, 0.2081, −0.0757,0.0506)

[0176] b₂=(0.1525, 0.1048, −0.1891, −0.1399, −0.0539, 0.0112, 0.0281,−0.0246, −0.0922, 0.1455, −0.0059, 0.0001, −0.0215, 0.0134, 0.0062,−0.0386, 0.0938, 0.1603, −0.0026, −0.0785, −0.0104, 0.0285, 0.0181,−0.0550, −0.0719, 0.0243, −0.2403, −0.0425, −0.1203, −0.1199, −0.0351,−0.0518, −0.1903, −0.0411, 0.3417, 0.0179, −0.0391, −0.0644, 0.0178,−0.0177, −0.1515, 0.0320, −0.1318, 0.3510, −0.0449 −0.0138, 0.1530,0.3162, 0.1676, 0.0160, 0.0342, −0.0725, −0.0221, −0.0656, 0.0271,−0.1325, −0.0695, 0.0942, −0.0141, 0.4003, −0.0403, −0.1298, 0.1701,0.0146)

[0177] Here, the standard deviations (σ₁ and σ₂ of the first and secondprincipal components of the two input vectors are 0.05515 and 0.03757respectively, and the average x_(ave) of the input vectors isx_(ave)=(0.0266, 0.0167, 0.0217, 0.0169, 0.0105, 0.0096, 0.0098, 0.0071,0.0170, 0.0148, 0.0020, 0.0008, 0.0051, 0.0058, 0.0015, 0.0114, 0.0175,0.0151, 0.0078, 0.0242, 0.0098, 0.0104, 0.0096, 0.0125, 0.0105, 0.0094,0.0170, 0.0168, 0.0091, 0.0115, 0.0038, 0.0073, 0.0309, 0.0215, 0.0170,0.0235, 0.0106, 0.0169, 0.0119, 0.0108, 0.0205, 0.0182, 0.0378, 0.0235,0.0093, 0.0129, 0.0104, 0.0108, 0.0228, 0.0148, 0.0125, 0.0230, 0.0189,0.0237, 0.0190, 0.0207, 0.0296, 0.0202, 0.0403, 0.0278, 0.0178, 0.0210,0.0177, 0.0139)

[0178] Next, two-dimensional lattice points are represented as ij(i=1,2, . . . , I, j=1, 2, . . . , j), and 64- dimensional neuron vectors W⁰_(y)=(w⁰ _(y1), w⁰ _(y2), . . . , w⁰ _(y64)) are placed at thetwo-dimensional lattice points (ij). Here I is 100, and J is the largestinteger less than I×σ₂/σ₁. In the present analysis, J turned out to be68. W⁰ _(y) is defined by an equation (27). $\begin{matrix}{W_{ij}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (27)\end{matrix}$

[0179] (b) Classification of Neuron Vectors

[0180] Input vectors x_(k) based on each of 29596 kinds of genes areclassified into neuron vectors W^(t) _(y) with the smallest Euclideandistances.

[0181] (c) Update of Neuron Vectors

[0182] Next, the neuron vectors W^(t) _(y) are updated by the followingequation (28). $\begin{matrix}{W_{ij}^{t + 1} = {W_{ij}^{t} + {{\alpha (t)}\left( {\frac{\sum\limits_{x_{k} \in S_{ij}}x_{k}}{N_{ij}} - W_{ij}^{t}} \right)}}} & (28)\end{matrix}$

[0183] The learning coefficient α(t) (0<α(t)<1) for the t-th epoch whenthe number of learning cycles is set to T epochs is obtained by anequation (29). The present experiment is performed with 100 learningcycles (100 epochs). $\begin{matrix}{{\alpha (t)} = {\max \left\{ {0.01,{0.6\left( {1 - \frac{t}{T}} \right)}} \right\}}} & (29)\end{matrix}$

[0184] The neighboring set S_(y) is a set of input vectors x_(k)classified as lattice points i′j′ which satisfy the conditions ofi−β(t)≦i′≦i+β(t) and j−β(t)≦j′≦j+β(t). Furthermore, N_(j1) is the totalof vectors classified into S_(y). β(t) represents a number thatdetermines the neighborhood, and is obtained by an equation (30).

β(t)=max{0, 25−t}  (30)

[0185] (d) Learning Process Subsequently, the above steps (b) and (c)are repeated 100 (=T) times.

[0186] Here, the learning is evaluated by square error as defined by thefollowing equation (3 1). $\begin{matrix}{Q^{t} = {\sum\limits_{n = 1}^{N}{{x_{k} - W_{{ij}{(k)}}^{t}}}^{2}}} & (31)\end{matrix}$

[0187] Here, N (=29596) is the total number of genes, and W^(t) _(y(k))is the neuron vector with the smallest Euclidean distance from x_(k).The interpretation of this learning evaluation value Q is that thesmaller the value, the better the information of the input vectors isreflected on the neuron vectors.

[0188] (e) Classification of Input Vectors into Neuron Vectors

[0189] The input vectors x_(k) based on each of the 29596 kinds of genesare classified into neuron vectors W^(T) _(y(k)) obtained as a result of100 epochs of learning cycles with the smallest Euclidean distance.

[0190] An SOM obtained by the classification is shown in FIG. 2. In acase where a gene originating from only one kind of microorganism isclassified, the class number of the microorganism from which the geneoriginated is displayed in Table 1. For the SOM in FIG. 2, it is shownthat exactly the same map as in FIG. 2 is obtained even if it isrecreated, and a reproducible map can be created.

[0191] Neuron vectors W⁰ _(y) were defined using random numbers foranalysis without performing principal component analysis for inputvectors, and the result is shown in a reference example 1 describedlater. However, the results are different for each analysis in theconventional method, and the same and reproducible result could not beobtained (refer to FIG. 3A and FIG. 3B).

[0192] Furthermore, “the relationship between the number of learningcycles (epochs) and learning evaluation value Q” is shown in FIG. 4. Byusing a method of the present invention wherein principal componentanalysis is used for initial value setting, it has been shown that inputvector data can be reflected in neuron vectors better in fewer learningcycles than in the case where initial values are set by random numbersas described in the reference example 1, that is to say, a shortening ofcalculation time and improvement in classification accuracy can beachieved.

[0193] Updating of neuron vectors was performed by a sequentialprocessing algorithm of the conventional method for analysis, and theresults are shown in a reference example 2 described later. However, inthe conventional method, a different SOM was created depending on theinput order of input vectors x_(k), and the degree of grouping was verylow.

[0194] As described above, according to the present invention, it hasbeen shown that it is possible to achieve the same and reproducibleanalysis result (SOM) independent of the input order of input vectors,in a short time and with high accuracy.

Reference Example 1

[0195] The classification analysis of genes of each of 16 kinds ofmicroorganism was performed by the same method as in Example 1 exceptthat neuron vectors W⁰ _(y) were defined using random numbers withoutperforming principal component analysis for input vectors in step 1—(a)in the above-described Example 1.

[0196] Neuron vector W^(t) _(y) was defined by the following method.

[0197] Random numbers were generated by using the C language standardfunction rand( ) within a range between a minimum valuemin_(k=1,2, ,29596) (x_(km)) and a maximum value max _(k=1,2, ,29596)(x_(km)) for each of the m-th (m=1, 2, . . . , 64) variable of inputdata x_(k) ={x_(k1), x_(k2), . . . , x_(k64)} (k=1, 2, . . . , 29596)obtained by the above-described Example 1—(a). Neuron vectors W⁰ _(y)were defined by the following equation (32).

[0198] Here, W⁰ _(y)=(W⁰ _(y1), W⁰ _(y2), . . . , W⁰ _(ym), . . . , W⁰_(y64)).

W⁰ _(ym)=min_(k=1,2, ,29596)(x_(km))+

{max_(k=1,2, 29596) (x_(km))−min_(k=1,2, ,29596) (x_(km))}rand()/2147483647   (32)

[0199] After defining W⁰ _(y), genes of each of 16 kinds ofmicroorganism were classified according to the method in Example 1. Thesame analysis was repeated.

[0200] Results of the two analyses are shown in FIGS. 3A and 3B.Furthermore, “the relationship between the number of learning cycles(epochs) and learning evaluation value Q” in the first analysis is shownin FIG. 4.

[0201] As shown in FIGS. 3A and 3B, when random numbers are used,completely different SOMs are created for each analysis, and the degreeof grouping is lower than in the case where principal component analysisas shown in embodiment 1 is performed.

Reference Example 2

[0202] The classification analysis of genes of each of 16 kinds ofmicroorganism was performed by performing classification and updating ofneuron vectors in steps (b) and (c) in the above-described Example 1using the following equation (33) instead of the equation (28).

W^(t+1) _(y(k))=W^(t) _(y(k))+α(t)(x_(k)−W^(t) _(y(k)))   (33)

[0203] That is, neuron vectors W^(t) _(y(k)) were updated using theabove equation (33) each time an input vector was classified in step (b)of Example 1.

[0204] A learning coefficient α(t) (0<α(t)<1) for the t-th epoch whenthe number of learning cycles is set to T epochs was obtained by theequation (29). The present experiment was performed with the number oflearning cycles being 100 (100 epochs). $\begin{matrix}{{\alpha (t)} = {\max \left\{ {0.01,{0.6\left( {1 - \frac{t}{T}} \right)}} \right\}}} & (34)\end{matrix}$

[0205] The neighboring neuron vectors of W^(t) _(y(k)) were also updatedaccording to the equation (33) at the same time as W^(t) _(y(k)). Theneighborhood S_(y) is a set of neuron vectors at lattice points i′j′which satisfy the conditions of i−β(t)≦i′≦i+β(t) and j−β(t)≦j′≦j+β(t).B(t) represents a number that determines the neighborhood, and isobtained by the equation (35).

β(t)=max{0, 25−t}  (35)

[0206] Neuron vectors were updated by inputting, in order, from inputvector x₁ of a gene whose ID number is 1 in Table 1 to input vectorx₂₉₅₉₆ of a gene whose ID number is 29596, and updating from W^(t+1)_(y(1)) to W^(t+1) _(y(29596)) in order.

[0207] The SOM obtained is shown in FIG. 5A.

[0208] The same analysis was performed in the reverse order of updatingthe neuron vectors W_(y(k)). That is, updating was performed fromW^(t+1) _(y(29596)) to W^(t+1) _(y(1)) in order followed by inputting,in order, from input vector x₂₉₅₉₆ of a gene whose ID number is 29596 inTable 1 to input vector x₁ of a gene whose ID number is 1.

[0209] The SOM obtained is shown in FIG. 5B.

[0210] The degree of grouping is very low, and in the SOM created in theinput order from ID number 1, genes originating from microorganisms ofID numbers 3 (Borrelia burgdorferi), and 5 (Chlamzdia trachomatis) werenot grouped at all. Furthermore, in the SOM created in the input orderfrom ID number 29596, genes originating from ID numbers 1 (Archaeoglobusfulgidus), 4 (Bacillus subtilis), 5 (Chlamzdia trachomatis), 9(Methanococcus jannascii) and 12 (Mycoplasma genitalium) were notgrouped at all. It has been shown that different SOMs are createddepending on the input order of data, so it is not appropriate to usethe present method to interpret data wherein the input order ismeaningless.

Example 2 Analysis of Gene Expression Levels and Classification of Genesin Cancer Cell Lines

[0211] Data from the results of measuring the expression level of eachgene in 60 strains of cancer cell using a DNA microarray, which isdescribed in “A Gene Expression Database for the Molecular Pharmacologyof Cancer” in Nature Genetics, 24, 236-244 (2000) (Uwe Scherf et al.)were analyzed using a method of the present invention. The 60 cancercell lines are shown in Tables 3-1 and 3-2. Here, the data was obtainedas “all_genes.txt” from a web page“http://discover.nci.nih.gov/nature2000/” opened by the authors of thispaper.

[0212] Of 10009 genes included in the present file, excluding thosegenes having description of “NA” or “−INF”, cDNAs of 5544 human geneswere used for analysis.

[0213] The analysis of the data was performed based on the method inembodiment 1. TABLE 3-1 Cell lines used for analysis by DNA microarrayAbbreviation of Cell Strain Name of Cell Strain Class ME:LOXIMVIMelanoma line 1 ME:MALME-3M Melanoma line 2 ME:SK-MEL-2 Melanoma line 3ME:SK-MEL-5 Melanoma line 4 ME:SK-MEL-28 Melanoma line 5 LC:NCI-H23Non-small-cell lung cancer cells 6 ME:M14 Melanoma line 7 ME:UACC-62Melanoma line 8 LC:NCI-H522 Non-small-cell lung cancer cells 9LC:A549/ATCC Non-small-cell lung cancer cells 10 LC:EKVX Non-small-celllung cancer cells 11 LC:NCI-H322M Non-small-cell lung cancer cells 12LC:NCI-H460 Non-small-cell lung cancer cells 13 LC:HOP-62 Non-small-celllung cancer cells 14 LC:HOP-92 Non-small-cell lung cancer cells 15CNS:SNB-19 CNS lines 16 CNS:SNB-75 CNS lines 17 CNS:U251 CNS lines 18CNS:SF-268 CNS lines 19 CNS:SF-295 CNS lines 20 CNS:SF-539 CNS lines 21CO:HT29 Colon cancer lines 22 CO:HCC-2998 Colon cancer lines 23CO:HCT-116 Colon cancer lines 24 CO:SW-620 Colon cancer lines 25CO:HCT-15 Colon cancer lines 26 CO:KM12 Colon cancer lines 27 OV:OVCAR-3Ovarian lines 28 OV:OVCAR-4 Ovarian lines 29 OV:OVCAR-8 Ovarian lines 30

[0214] TABLE 3-2 Cell lines used for analysis by DNA microarrayAbbreviation of Cell Strain Name of Cell Strain Class OV:IGROV1 Ovarianlines 31 OV:SK-OV-3 Ovarian lines 32 LE:CCRF-CEM Leukamia 33 LE:K-562Leukamia 34 LE:MOLT-4 Leukamia 35 LE:SR Leukamia 36 RE:UO-31 Renalcarcinoma lines 37 RE:SN12C Renal carcinoma lines 38 RE:A498 Renalcarcinoma lines 39 RE:CAKI-1 Renal carcinoma lines 40 RE:RXF-393 Renalcarcinoma lines 41 RE:786-0 Renal carcinoma lines 42 RE:ACHN Renalcarcinoma lines 43 RE:TK-10 Renal carcinoma lines 44 ME:UACC-257Melanoma line 45 LC:NCI-H226 Non-small-cell lung cancer cells 46CO:COLO205 Colon cancer lines 47 OV:OVCAR-5 Ovarian lines 48 LE:HL-60Leukamia 49 LE:RPMI-8226 Leukamia 50 BR:MCF7 Breast origin 51BR:MCF7/ADF- Breast origin 52 RES PR:PC-3 53 PR:DU-145 54 BR:MDA-MB-Breast origin 55 231/ATCC BR:HS578T Breast origin 56 BR:MDA-MB-435Breast origin 57 BR:MDA-N Breast origin 58 BR:BT-549 Breast origin 59BR:T-47D Breast origin 60

[0215] (a) Calculation of Input Vectors and Setting of Initial NeuronVectors

[0216] The above 5544 human genes are numbered (k=1, 2, . . . , 5544) inorder, and input vectors x_(k)={x_(k1),x_(k2), . . . , x_(k)m}, are setusing data of the expression level of each gene in 60 cancer cell lines(m=1, 2, . . . , 60).

[0217] A first principle component vector b₁ and a second principalcomponent b₂ are obtained by performing principal component analysis onthe 5544 input vectors defined. The results are shown as follows.

[0218] b₁=(0.0896, 0.1288, 0.1590, 0.1944, 0.1374, 0.1599, 0.1391,0.1593, 0.1772, 0.0842, 0.0845, 0.0940, 0.1207, 0.0914, 0.1391, 0.0940,0.0572, 0.0882, 0.1192, 0.0704, 0.0998, 0.1699, 0.1107, 0.1278, 0.1437,0.1381, 0.1116, 0.0640, 0.0538, 0.0983, 0.1086, 0.1003, 0.2140, 0.1289,0.2224, 0.2147, 0.0781, 0.1618, 0.0762, 0.0641, 0.0682, 0.0859, 0.0785,0.0933, 0.1465, 0.0294, 0.1315, 0.1068, 0.1483, 0.1227, 0.1654, 0.1059,0.0872, 0.1158, 0.1877, 0.0316, 0.2129, 0.2098, 0.0892, 0.1450)

[0219] b₂=(−0.0521, −0.1201, −0.1072, −0.0397, −0.1300, 0.0219, −0.1011,−0.1356, 0.0482, −0.0195, −0.0520, 0.0434, −0.0207, −0.1006, −0.1727,−0.1212, −0.1955, −0.1003, −0.1803, −0.1443, −0.1692, 0.1880, 0.1319,0.0643, 0.1701, 0.1315, 0.1240, 0.0642, 0.0044, −0.0946, 0.0218,−0.0408, 0.2394, 0.2236, 0.2652, 0.1047, −0.1363, −0.1330, −0.1089,−0.1011, −0.1618, −0.1027, −0.1120, −0.0943, −0.0458, −0.0980, 0.2100,−0.0138, 0.2235, 0.1251, 0.1935, −0.0711, 0.0296, −0.0203, −0.1285,−0.2642, −0.1032, −0.0809, −0.1166, 0.0994)

[0220] Here, the standard deviations σ₁ and σ₂ of the first principalcomponent value and second principal component value of the 5544 inputvectors are 3.3367 and 2.0720 respectively, and x_(ave) the average ofthe input vectors is x_(ave)=(−0.0164, −0.0157, −0.0306, 0.0043,−0.0529, 0.0730, −0.0421, 0.0132, 0.0020, −0.0544, −0.0592, 0.0192,−0.0320, 0.0513, −0.0712, −0.0336, −0.0131, 0.0170, −0.1138, −0.1020,0.0504, −0.1454, 0.0255, −0.0727, 0.0164, 0.0704, 0.0579, 0.0140,−0.0322, 0.0588, −0.0390, 0.0878, −0.0175, −0.1021, −0.1015, −0.0833,0.0137, −0.1347, −0.0009, 0.0424, 0.0168, −0.0164, −0.0243, 0.0203,−0.0417, 0.0220, −0.0592, −0.0317, −0.0372, −0.1114, −0.1365, 0.0383,0.0142, 0.0608, −0.1329, −0.0718, −0.1357, −0.0276, −0.0131, 0.0022).

[0221] Next, two-dimensional lattice points are represented by ij(i=1,2, . . . , I, j=1, 2, . . . , J), and 60- dimensional neuron vectors W⁰_(y)=(w⁰ _(y1),w⁰ _(y2), . . . , w⁰ _(y64)) are placed at thetwo-dimensional lattice points (ij). Here I is 50, and J is the largestinteger less than I×σ₂/σ₁. In the present analysis, J turned out to be31. W⁰ _(y) is defined by an equation (16). $\begin{matrix}{W_{ij}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (36)\end{matrix}$

[0222] (b) Classification of Neuron Vectors

[0223] Next, all of the 5544 input vectors x_(k) are classified intoneuron vectors W^(t) _(y) with the smallest Euclidean distance. Neuronvectors into which x_(k) are classified are represented by W^(t)_(y(k)).

[0224] (c) Update of Neuron Vectors

[0225] Next, the neuron vectors W^(t) _(y) are updated by the followingequation (37). $\begin{matrix}{W_{ij}^{t + 1} = {W_{ij}^{t} + {{\alpha (t)}\left( {\frac{\sum\limits_{x_{k} \in S_{ij}}x_{k}}{N_{ij}} - W_{ij}^{t}} \right)}}} & (37)\end{matrix}$

[0226] The learning coefficient α(t) (0<α(t)<1) for epoch t when thenumber of learning cycles is set to T epochs is obtained by an equation(38). $\begin{matrix}{{\alpha (t)} = {\max \left\{ {0.01,{0.6\left( {1 - \frac{t}{T}} \right)}} \right\}}} & (38)\end{matrix}$

[0227] The neighboring set S_(y) is a set of input vectors x_(y)classified as lattice points i′j′ which satisfy the conditions ofi−β(t)≦i′≦i+β(t) and j−β(t)≦j′≦j+β(t). Furthermore, N_(y) is the totalnumber of vectors classified into S_(y). The symbol β(t) represents thenumber that determines the neighborhood, and is obtained by an equation(39).

β(t)=max{0, 10-t}  (39)

[0228] (d) Learning Process

[0229] Next, the above steps (b) and (c) are repeated 100 (=T) times.

[0230] Here, the learning effectiveness for the t-th epoch is evaluatedby square error as defined by the following equation (40).$\begin{matrix}{Q^{t} = {\sum\limits_{n = 1}^{N}{{x_{k} - W_{{ij}{(k)}}^{\prime}}}^{2}}} & (40)\end{matrix}$

[0231] Here, N is the total number of genes, and W_(y(k)) is a neuronvector with the smallest Euclidean distance from x_(k).

[0232] (e) Classification of Input Vectors into Neuron Vectors

[0233] All of the 5544 input vectors x_(k) are classified into neuronvectors W^(T) _(y(k)) obtained as a result of 100 epochs of learningcycles with the smllest Euclidean distance. An SOM obtained by theclassification is shown in FIG. 6. The numbers of genes classified intoeach neuron are shown in FIG. 6.

[0234] In FIG. 7, neuron vectors (FIG. 7A) at a position [16, 29] andall vectors (FIGS. 7A, 7B and 7C) of genes (genes encoded in colons ofEST entries in GenBank Accession Nos. T55183 and T54809 and genesencoded in EST clones of Accession Nos. W76236 and W72999) classifiedare shown as a bar chart. This figure clarifies that FIGS. 7B and 7C,and FIG. 7A show very similar vector patterns.

[0235] That is to say, it has been shown that genes whose expressionpatterns are almost the same in human cells can be classified into thesame neuron vectors by a method of the present invention.

Industrial Applicability

[0236] According to the present invention, it is possible to create thesame and reproducible self-organizing map using an enormous amount ofinput data, and classify and obtain useful information with highaccuracy. Furthermore, calculation processing time can be shortenedconsiderably.

What is claimed is:
 1. A method for classifying input vector data withhigh accuracy by a nonlinear mapping method using a computer, whichcomprises the following steps (a) to (f): (a) inputting input vectordata to a computer, (b) setting initial neuron vectors, (c) classifyingan input vector into one of neuron vectors, (d) updating neuron vectorsso as to have a similar structure to structures of input vectorsclassified into the neuron vector and input vectors classified into theneighborhood of the neuron vector, (e) repeating step c and step d untila preset number of learning cycles is reached, and (f) classifying aninput vector into one of neuron vectors and outputting.
 2. The methodaccording to claim 1, wherein the input vector data are data of K inputvectors (K is a positive integer of 3 or above) of M dimensions (M is apositive integer).
 3. The method according to claim 1, wherein theinitial neuron vectors are set by reflecting on the arrangement orelements of the initial neuron vectors, the distribution characteristicsof input vectors of multiple dimensions in multidimensional space,obtained by an unsupervised multivariate analysis technique.
 4. Themethod according to claim 3, wherein the unsupervised multivariateanalysis technique is the principal component analysis or themultidimensional scaling.
 5. The method according to claim 1, whereinthe classifying an input vector into one of neuron vectors is performedbased on the similarity scaling selected from the group consisting ofscaling of distance, inner product, and direction cosine.
 6. The methodaccording to claim 5, wherein the distance is a Euclidean distance. 7.The method according to claim 1 or 6, wherein the classifying an inputvectors into one of neuron vectors is performed using a batch-learningalgorithm.
 8. The method according to claim 1, wherein the updatingneuron vector so as to have a similar structure to structures of inputvectors classified into the neuron vector and input vectors classifiedinto the neighborhood of the neuron vector, is performed using abatch-learning algorithm.
 9. The method according to claim 7 or 8,wherein the method is performed using parallel computers.
 10. A methodof classifying input vector data with high accuracy using a computer, bya nonlinear mapping method, which comprises the following steps (a) to(f): (a) inputting K (K is a positive integer of 3 or above) inputvectors x_(k) ( k=1, 2, . . . , K) of M dimensions (M is a positiveinteger) represented by the following equation (1) to a computer,x_(k)={x_(k1), x_(k2), . . . , x_(kM)}  (1) (b) setting P initial neuronvectors W⁰ ₁ (here, i=1, 2, . . . , P) arranged in a lattice of Ddimensions (D is a positive integer) represented by the followingequation (2), W⁰ ₁=F{x₁, x₂, . . . , x_(K)}  (2) (in which, F{x₁, x₂, .. . , x_(k)} represents a function for converting from input vectors{x₁, x₂, . . . , x_(K)} to initial neuron vectors) (c) classifying theinput vectors {x₁, x₂, . . . , x_(K)} after t (t is the number of thelearning cycle, t=0, 1, 2, . . . T) learning cycles into one of P neuronvectors W^(t) ₁, W^(t) ₂, . . . , W^(t) _(P), arranged in a lattice of Ddimensions, using similarity scaling, (d) for each neuron vector W^(t)₁, updating the neuron vector W^(t) ₁ so as to have a similar structureto structures of input vectors classified into the neuron vector, andinput vectors x^(t) ₁(S₁),x^(t) ₂(S₁), . . . , x^(t) _(N1) (S₁)classified into the neighborhood of the neuron vector, by the followingequation (3), W^(t+1) ₁=G(W^(t) ₁, x^(t) ₁(S₁), x^(t) ₂(S₁), . . . ,x^(t) _(N1)(S₁) )   (3) [in which, x^(t) _(n)(S₁) (n=1, 2, . . . , N₁)represents N₁ vectors (N₁ is the number of input vectors classified intoneuron i and neighboring neurons) with M dimensions (M is a positiveinteger), W^(t) ₁, represents P neuron vectors (t is the number oflearning cycles, i=1, 2, . . . , P) arranged in a lattice of Ddimensions (D is a positive integer); when a set of input vectorsassociated with the neighboring lattice point to a lattice point where aspecific neuron vector W^(t) ₁ is positioned, {x^(t) ₁(S₁), x^(t) ₂(S₁),. . . , x^(t) _(N)(S₁)} equals S₁, the above equation (3) is an equationto update the neuron vector W^(t) ₁ to neuron vector W^(t+1) ₁]. (e)repeating step (c) and step (d) until a preset number of learning cyclesT is reached, and (f) classifying the input vectors {x₁, x₂, . . . ,x_(K)} into one of W^(T) ₁, W^(T) ₂, . . . , W^(T) _(P) using similarityscaling, and outputting a result.
 11. A method of classifying inputvector data with high accuracy using a computer, by a nonlinear mappingmethod, which comprises the following steps (a) to (f): (a) inputting K(K is a positive integer of 3 or above) input vectors x_(k) (here, k=1,2, . . . , K) of M dimensions (M is a positive integer) expressed by thefollowing equation (4) to a computer, x_(k)={x_(k1), x_(k2), . . . ,x_(kM)}  (4) (b) setting P (P=I×J) initial neuron vectors W⁰ _(y)arranged in a two-dimensional (i, j) lattice (i=1, 2, . . . , I, j=1, 2,. . . , J) by the following equation (5), $\begin{matrix}{W_{ij}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (5)\end{matrix}$

[in which, x_(ave) represents the average value of the input vectors, b₁and b₂ are the first principal component vector and the second principalcomponent vector respectively obtained by the principal componentanalysis on the input vectors {x₁, x₂, . . . , x_(K)}, and σ₁ denotesthe standard deviation of the first principal component of the inputvectors.] (c) classifying the input vectors {x₁, x₂, . . . , x_(K)}after t learning cycles, into one of P neuron vectors W^(t) ₁, W^(t) ₂,. . . , W^(t) _(P) arranged in a two-dimensional lattice (t is thenumber of learning cycles, t=0, 1, 2, . . . T) using similarity scaling,(d) updating each neuron vector W^(t) _(y) to W^(t+1) _(y) by thefollowing equations (6) and (7), $\begin{matrix}{W_{ij}^{t + 1} = {W_{ij}^{t} + {{\alpha (t)}\left( {\frac{\sum\limits_{x_{k} \in S_{ij}}x_{k}}{N_{ij}} - W_{ij}^{t}} \right)}}} & (6) \\{{\alpha (t)} = {\max \left\{ {0.01,{0.6\quad \left( {1 - \frac{t}{T}} \right)}} \right\}}} & (7)\end{matrix}$

[in which, W^(t) _(y) represents P (P=I×J) neuron vectors arranged on atwo-dimensional (i, j) lattice (i=1, 2, . . . , I, j=1, 2, . . . , J)after t learning cycles, and the above equation (6) is an equation toupdate w^(t) _(y) to W^(t+1) _(y) as so as to have a similar structureto structures of input vectors (x_(k)) classified into the neuron vectorand N_(ij) input vectors x^(t) ₁(S_(y)), x^(t) ₂(S_(y)), . . . , x^(t)_(N)(S_(y)) classified into the neighborhood of the neuron vector; theterm α(t) designates a learning coefficient (0<α(t)<1) for the t-thepoch when the number of learning cycles is set to T epochs, and isexpressed using a monotone decreasing function.] (e) repeating step (c)and step (d) until a preset number of learning cycles T is reached, and(f) classifying the input vectors {x₁, x₂, . . . , x_(K)} into one ofW^(T) _(1, W) ^(T) ₂, . . . , W^(T) _(P) using similarity scaling, andoutputting a result.
 12. A computer readable recording medium on whichis recorded a program for performing the method according to any one ofclaims 1 to 11, which updates neuron vectors so as to have a similarstructure to structures of input vectors classified into the neuronvector and input vectors classified into the neighborhood of neuronvector.
 13. The recording medium according to claim 12, wherein saidprogram is a program using a batch-learning algorithm.
 14. The recordingmedium according to claim 12 or 13, wherein said program is a programfor performing the processing of the following equation (8): W^(t+1)₁,=G(W^(t) ₁, x^(t) ₁(S₁), x^(t) ₂(S₁), . . . , x^(t) _(N1(S) ₁) )   (8)[in which, x^(t) _(k)(k=1, 2, . . . , N) represents K input vectors (Kis a positive integer of 3 or above) of M dimensions (M is a positiveinteger), W^(t) ₁ represents P neuron vectors (t is the number oflearning cycles, i=1, 2, . . . , P) arranged in a lattice of Ddimensions (D is a positive integer); when a set of input vectorsassociated with the neighboring lattice point to a lattice point where aspecific neuron vector W^(t) ₁ is positioned {x^(t) ₁(S₁), x^(t) ₂(S₁),. . . , x^(t) _(N1)(S₁)} is designated as S₁, the above equation (8) isan equation to update the neuron vector W^(t) ₁ to neuron vector W^(t+1)₁.]
 15. The recording medium according to claim 12 or 13, wherein saidprogram is a program for performing the processing of the followingequations (9) and (10): $\begin{matrix}{W_{ij}^{t + 1} = {W_{ij}^{t} + {{\alpha (t)}\left( {\frac{\sum\limits_{x_{k} \in S_{ij}}x_{k}}{N_{ij}} - W_{ij}^{t}} \right)}}} & (9) \\{{\alpha (t)} = {\max \left\{ {0.01,{0.6\quad \left( {1 - \frac{t}{T}} \right)}} \right\}}} & (10)\end{matrix}$

[in which, W^(t) _(y) represents P (P=I×J) neuron vectors arranged in atwo-dimensional (i, j) lattice (i=1, 2, ..., I, j=1, 2, ..., J after tlearning cycles, and the above equation (9) is an equation to updateW^(t) _(y) to W^(t+1) _(y) so as to have a similar structure tostructures of input vectors (x_(k)) classified into the neuron vectorand N_(y) input vectors x^(t) ₁(S_(y)), x^(t) ₂(S_(y)), . . . , x^(t)_(N)(S_(y)) classified into the neighborhood of the neuron vectors; theterm α(t) designates a learning coefficient (0<α(t)<1) for the t-thepoch when the number of learning cycles is set to T epochs, and isexpressed using a monotone decreasing function.]
 16. A computer readablerecording medium on which is recorded a program for setting the initialneuron vectors in order to perform the method according to any one ofclaims 1 to
 11. 17. The recording medium according to claim 16, whereinsaid program is a program for performing the process of followingequation (11): W⁰ ₁=F{x₁, x₂, . . . , x_(K)}  (11) [in which, W⁰ ₁represents P initial neuron vectors arranged in a lattice of Ddimensions (D is a positive integer), i is one of 1, 2, . . . , P, andF{x₁, x₂, . . . , x_(k)} is a function for converting input vectors {x,x₂, . . . , x_(K)} to initial neuron vectors.
 18. The recording mediumaccording to claim 16, wherein said program is a program for performingthe processing of the following equation (12): $\begin{matrix}{W_{ij}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (12)\end{matrix}$

[in which, W⁰ _(y) represents P (P=I×J) initial neuron vectors arrangedin a two-dimensional (i, j) lattice (i=1, 2, . . . , I, j=1, 2, . . . ,J), x_(ave) is the average value of K (K is a positive integer of 3 orabove) input vectors {x₁, x₂, . . . , x_(K)} of M dimensions (M is apositive integer), b₁ and b₂ are a first principal component vector anda second principal component vector respectively obtained by principalcomponent analysis on the input vectors {x₁, x₂, . . . , x_(K)}, and σ₁is the standard deviation of the first principal component of the inputvectors.]
 20. A computer readable recording medium on which is recordeda program for setting initial neuron vectors for performing the methodaccording to any one of claims 1 to 11, and a program for updatingneuron vectors so as to have a similar structure to structures of inputvectors classified into the neuron vector and input vectors classifiedinto the neighborhood of the neuron vector.
 21. The recording mediumaccording to claim 20, wherein the program is a program for performingthe processing of the following equations (13) and (14): W⁰=F{x₁, x₂, .. . , x_(K)}  (13) (in which, W⁰ ₁, represents P initial neuron vectorsof D dimensions (D is a positive integer) arranged in a lattice, i isone of 1, 2, . . . , P, and F{x₁, x₂, . . . , x_(K)} is a function forconverting from K (K is a positive integer of 3 or above) input vectors{x₁, x₂, . . . , x_(K)} of M dimensions (M is a positive integer) toinitial neuron vectors) W^(t+1) ₁=G(W^(t) ₁, x^(t) ₁(S₁), x^(t) ₂(S₁), .. . , x^(t) _(N)(S₁) )   (14) [in which, x^(t) _(n)(S₁) (n=1, 2, . . . ,Ni) represents Ni (Ni is the number of input vectors classified intoneuron i and the neighboring neurons) input vectors of M dimensions (Mis a positive integer), W^(t) ₁, represents P neuron vectors (t is thenumber of the learning cycle, i=1, 2, . . . , P) arranged in a latticeof D dimensions (D is a positive integer), and the above equation (14)is an equation to update W^(t) ₁ to W^(t+1) ₁ such that each neuronvector has a similar structure to structures of Ni input vectors x^(t)_(n)(S₁) classified into the neuron vector].
 22. The recording mediumaccording to claim 20, wherein the program is a program for performingprocessing of the following equations (15), (16) and (17):$\begin{matrix}{W_{ij}^{0} = {x_{ave} + {5\sigma_{1}\left\{ {{b_{1}\left( \frac{i - {I/2}}{I} \right)} + {b_{2}\left( \frac{j - {J/2}}{J} \right)}} \right\}}}} & (15)\end{matrix}$

[in which, W⁰ _(y) represents P (P=I×J) initial neuron vectors arrangedin a two-dimensional (i, j) lattice (i=1, 2, . . . , I,j=1, 2, . . . ,J), x_(ave) is the average value of K (K is a positive integer of 3 orabove) input vectors {x₁, x₂, . . . , x_(K)} of M dimensions (M is apositive integer), b₁ and b₂ are the first principal component vectorand the second principal component vector respectively obtained byprincipal component analysis on the input vectors {x₁, x₂, . . . ,x_(K}, and σ) ₁ is the standard deviation of the first principalcomponent of the input vectors] $\begin{matrix}{W_{ij}^{t + 1} = {W_{ij}^{t} + {{\alpha (t)}\left( {\frac{\sum\limits_{x_{k} \in S_{ij}}x_{k}}{N_{ij}} - W_{ij}^{t}} \right)}}} & (16) \\{{\alpha (t)} = {\max \left\{ {0.01,{0.6\quad \left( {1 - \frac{t}{T}} \right)}} \right\}}} & (17)\end{matrix}$

[in which, W^(t) _(y) represents P (P=I×J) initial neuron vectors (t isthe number of learning cycle, t=1, 2, . . . , T) arranged in atwo-dimensional (i,j) lattice (i=1, 2, . . . , I,j=1, 2, . . . , J), andthe above equation (16) is an equation to update W^(t) _(y) toW^(t+1)such that each neuron vector has a similar structure tostructures of input vectors classified into the neuron vector and N_(ij)input vectors x^(t) _(n)(S_(y)) classified into the neighborhood of theneuron vector; the term α(t) denotes a learning coefficient (0<α(t)<1)for the t-th epoch when the number of learning cycles is set to Tepochs, and is expressed using a monotone decreasing function.].
 23. Therecording medium according to any one of claims 12 to 22, wherein therecording medium is a recording medium selected from floppy disk, harddisk, magnetic tape, CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM and DVD-RW.24. A computer based system using the computer readable recording mediumaccording to any one of claims 12 to 23.